With the internet, we are in a new age of data:
Jenny Bryan said: “Classroom data are like teddy bears and real data are like a grizzly bear with salmon blood dripping out its mouth.”
| Traditional Classroom Data | Real Data |
|---|---|
Some attributes of real data:
Inconsistent formatting is a real pain:
- Dates: “2016/10/12” vs “2016-10-12” vs “10/12/16” vs “10/12/2016” vs “Oct 12, 2016”
- “DC” vs “D.C.” vs “District of Columbia”
- “Beyonce” vs “Beyoncé”
To take this, we now officially introduce the dplyr package: a grammar of data manipulation
function() you use.Say hello to the 5MV: the five main verbs
filter() rows/observations matching criteriasummarise() numerical variablesgroup_by() group rows/observations by a categorical variablemutate() existing variables to create new onesarrange() rows_join() two separate data frames by corresponding variables
Scatterplot AKA bivariate plotLine-graphHistogramBoxplot- Barplot AKA Barchart AKA bargraph
Recall from first Grammar of Graphics lecture, we displayed
Say these piecharts represent polls for a local election with 5 candidates at time points A, B, and C:
Answer the following questions:
geom_bar() is the trickiest of the 5NG, so we’ll use it in limited capacity.Two different ways to have counts show on y-axis:
- Computed internally by
geom_bar()- Precomputed manually by yourself in your
datain a variablecount,n, etc.
Counts are not pre-computed:
| Row Number | name |
|---|---|
| 1 | Albert |
| 2 | Albert |
| 3 | Albert |
| 4 | Mo |
| 5 | Mo |
Counts are pre-computed in variable n. So n becomes a y aesthetic variable!
| name | n |
|---|---|
| Albert | 3 |
| Mo | 2 |
- In-class Wed 3/8
- Closed book, no calculators
ggplot2?
Scatterplot AKA bivariate plotLine-graphHistogram- Boxplot
- Barplot AKA Barchart AKA bargraph
If I know your name, I can guess your age. Looking at the handout answer the following questions:
As of Jan 1st, 2014 in the United States
- What can you say about females named Ella vs Zoe?
- What can you say about males named Aidan vs Oliver?
- What proportion of male Connors are younger than 16?
- What proportion of female Gertrudes are older than 69?
Chalk Talk: Age of 544 Members of 113th United States Congress:
- 439 members of House of Representatives
- 105 Senators
Scatterplot AKA bivariate plotLine-graph- Histogram
- Boxplot
- Barplot AKA Barchart AKA bargraph
From okcupiddata package, the profiles data set:
Restricted to heights between 55 (5’5’‘) and 80 (6’8’’) inches:
- The y-axis displays notions of relative frequency i.e. which values occur more than others.
- Huge definition: they are a visualization of the statistical distribution of values.
- We have an
xaesthetic- Counts on the y-axis not an explicit variable in the data set, but rather are computed internally. i.e. No
yaesthetic- The shape of a histogram is dependent on the structure of the bins on the x-axis.
For values: \(-2.5, -1.5, -0.5, 0.5, 1.5, 2.5\)
Let’s draw histograms using the following binning structures:
- (-3, -2, -1, 0, 1, 2, 3)
- (-4, -2, 0, 2, 4)
- (-4, 4)
Facets allow you split ANY plot by a categorical variable. In this case by adding +facet_wrap(~sex) to the ggplot() call
Scatterplot AKA bivariate plot- Line-graph
- Histogram
- Boxplot
- Barplot AKA Barchart AKA bargraph
A statistical graphic is a mapping of data variables to aes()thetic attributes of geom_etric objects.
ggplot(data=simple_ex, aes(x=A, y=B, size=C, color=D )) +
geom_line()
- Scatterplot AKA bivariate plot
- Line-graph
- Histogram
- Boxplot
- Barplot AKA Barchart AKA bargraph
What’s not great about this plot, especially near (0, 0)?
This is called overplotting: when points are stacked so densely we can’t see what’s going on!
There are two ways of dealing with this:
A statistical graphic is a mapping of data variables to aes()thetic attributes of geom_etric objects.
The five named graphs we’ll see in this class. Note: I reordered them from last time to be easiest to hardest to work with:
- Scatterplot AKA bivariate plot
- Line-graph
- Histogram
- Boxplot
- Barplot AKA Barchart AKA bargraph
ggplot2 packageIn tidy format:
| A | B | C | D |
|---|---|---|---|
| 1 | 1 | 3 | Hot |
| 2 | 2 | 2 | Hot |
| 3 | 3 | 1 | Cold |
| 4 | 4 | 2 | Cold |
In 1812, Napoleon led a French invasion of Russia, marching on Moscow.
It was one of the biggest military disasters ever, in particular b/c of the Russian winter.
Famous graphical illustration of Napolean’s march to/from Moscow
This was considered a revolution in statistical graphics because between
- the map on top
- the line graph on the bottom
there are 6 dimensions of information (i.e. variables) being displayed on a 2D page.
A statistical graphic is a mapping of data variables to aes()thetic attributes of geom_etric objects.
| Where? | data |
aes() |
geom_ |
|---|---|---|---|
| top map | longitude | x |
point |
| “ | latitude | y |
point |
| “ | army size | size |
path |
| “ | army direction (forward vs retreat) | color |
path |
| bottom graph | date | x |
line & text |
| “ | temperature | y |
line & text |
| 2005 - Proposal | 2009 - R Implementtation |
|---|---|
From ggplot2movies package, the movies data set:
From nycflights13 package, the flights data set:
From okcupiddata package, the profiles data set:
From fueleconomy package, the vehicles data set:
From babynames package, the babynames data set:
Say hello to the 5NG: the five named graphs
The nycflights13 package contains “tidy data” all 336,776 flights that departed from NYC (e.g. EWR, JFK and LGA) in 2013.
To help understand what causes delays, it also includes a number of other useful datasets.
weather: hourly meterological data for each airportplanes: construction information about each planeairports: airport names and locationsairlines: translation between two letter carrier codes and namesIn small teams, take 3 minutes to write down
Recall the tradeoff:
| Less of this… | More of this… |
|---|---|
You need to install each package once.
You need to load a package everytime you want to use it.
library(PACKAGENAME) in the console.Today’s Learning Check: Install and then load 3 packages:
dplyr: a package for data manipulationggplot2: a package for data visualizationbabynames: a package of baby name datababynames PackageThe babynames package contains for each year from 1880 to 2013, the number of children born of each sex given each name in the United States. Only names with more than 5 occurrences are considered.
Have students engage in the data/science research pipeline in as faithful a manner as possible while maintaining a level suitable for novices.
We will, as best we can, perform all this:
And not just this, as in many previous intro stats courses:
Foster a conceptual understanding of statistical topics and methods using simulation/resampling and real data whenever possible, rather than mathematical formulae.
In this course, computers and not math will be the “engine”. What does this mean?
Blur the traditional lecture/lab dichotomy of introductory statistics courses by incorporating more computational and algorithmic thinking into the syllabus.
go/rstudio/ (on campus or via VPN)Develop statistical literacy by, among other ways, tying in the curriculum to current events, demonstrating the importance statistics plays in society.
Either
| R | RStudio | DataCamp |
|---|---|---|
- Login to
go/rstudio/with your Midd account- If you don’t have access, raise your hand. (Username: guest1, password: rstudioguest)
- In RStudio menu bar -> File -> New File -> R Script
- This is where you run/execute commands
- The “>” is the prompt. It means R is ready to receive commands
- If you don’t see a “>” and want to restart, press ESC.
Now we will use R via DataCamp instead of via RStudio, but just for driver’s ed. Two panels exist in both: